arXiv: 1507.06803v 1 [cs.NE] 24Jul2015 


l 


A Neighbourhood-Based Stopping Criterion for 
Contrastive Divergence Learning 

Enrique Romero Merino and Ferrari Mazzanti Castrillejo and Jordi Delgado Pin 


Abstract — Restricted Boltzmann Machines (RBMs) are general 
unsupervised learning devices to ascertain generative models of 
data distributions. RBMs are often trained using the Contrastive 
Divergence learning algorithm (CD), an approximation to the 
gradient of the data log-likelihood. A simple reconstruction error 
is often used as a stopping criterion for CD, although several 
authors [1], [2] have raised doubts concerning the feasibility 
of this procedure. In many cases the evolution curve of the 
reconstruction error is monotonic while the log-likelihood is 
not, thus indicating that the former is not a good estimator 
of the optimal stopping point for learning. However, not many 
alternatives to the reconstruction error have been discussed in the 
literature. In this manuscript we investigate simple alternatives 
to the reconstruction error, based on the inclusion of information 
contained in neighboring states to the training set, as a stopping 
criterion for CD learning. 


I. Introduction 

Learning algorithms for deep multilayer neural networks 
have been known for a long time [3], though they usually 
could not outperform simpler, shallow networks. In this way, 
deep multilayer networks were not widely used to solve 
large scale real-world problems until the last decade [4]. 
In 2006, Deep Belief Networks (DBNs) [5] came out as a 
real breakthrough in this field, since the learning algorithms 
proposed ended up being a feasible and practical method to 
train deep networks, with spectacular results [6]-[9]. DBNs 
have Restricted Boltzmann Machines (RBMs) [10] as their 
building blocks. 

RBMs are topologically constrained Boltzmann Machines 
(BMs) with two layers, one of hidden and another of visible 
neurons, and no intralayer connections. This property makes 
working with RBMs simpler than with regular BMs, and 
in particular the stochastic computation of the log-likelihood 
gradient may be performed more efficiently by means of Gibbs 
sampling [4], [11]. 

In 2002, the Contrastive Divergence (CD) learning algo¬ 
rithm was proposed as an efficient training method for product- 
of-expert models, from which RBMs are a special case [12], It 
was observed that using CD to train RBMs worked quite well 
in practice. This fact was important for deep learning since 
some authors suggested that a multilayer deep neural network 
is better trained when each layer is pre-trained separately as 
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if it were a single RBM [6], [7], [13]. Thus, training RBMs 
with CD and stacking up them seems to be a good way to go 
when designing deep learning architectures. 

However, the picture is not as nice as it looks, since CD 
is not a flawless training algorithm. Despite CD being an 
approximation of the true log-likelihood gradient [14], it is 
biased and it may not converge in some cases [15]—[17]. 
Moreover, it has been observed that CD, and variants such 
as Persistent CD [18] or Fast Persistent CD [19] can lead 
to a steady decrease of the log-likelihood during learning 
[2], [20]. Therefore, the risk of learning divergence imposes 
the requirement of a stopping criterion. There are two main 
methods used to decide when to stop the learning process. One 
is based on the monitorization of the reconstruction error [21]. 
The other is based on the estimation of the log-likelihood 
with Annealed Importance Sampling (AIS) [22], [23]. The 
reconstruction error is easy to compute and it has been often 
used in practice, though its adequacy remains unclear because 
of monotonicity [2], AIS seems to work better than the 
reconstruction error in most cases, though it is considerably 
more expensive to compute, and may also fail [1]. 

In this work we approach this problem from a completely 
different perspective. Based on the fact that the energy is a 
continuous and smooth function of its variables, the close 
neighborhood of the high-probability states is expected to 
acquire also a significant amount of probability. In this sense, 
we argue that the information contained in the neighborhood 
of the training data is valuable, and that it can be incorporated 
in the learning process of RBMs. In particular, we propose 
to use it in the monitorization of the log-likelihood of the 
model by means of a new quantity that depends on the 
information contained in the training set and its neighbors. 
Furthermore, and in order to make it computationally tractable, 
we build it in such a way that it becomes independent of the 
partition function of the model. In this way, we propose a 
neighborhood-based stopping criterion for CD and show its 
performance in several data sets. 

II. Learning in Restricted Boltzmann Machines 
A. Energy-based Probabilistic Models 

Energy-based probabilistic models define a probability dis¬ 
tribution from an energy function, as follows: 

-Energy(a Z,h) 

P(x,h) = --- , (1) 

where x and h stand for (typically binary) visible and hidden 
variables, respectively. The normalization factor Z is called 
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partition function and reads 


Z = E 


e -Energy (X,h) 


x,h 


( 2 ) 


Since only x is observed, one is interested in the marginal 
distribution 


P(x) 
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Energy (X,H) 


z 


(3) 


but the evaluation of the partition function Z is computation¬ 
ally prohibitive since it involves an exponentially large number 
of terms. In this way, one can not measure directly P{x). 

The energy function depends on several parameters 9 , that 
are adjusted at the learning stage. This is done by maximiz¬ 
ing the likelihood of the data. In energy-based models, the 
derivative of the log-likelihood can be expressed as 


8 log P(x; 9) 
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where the first term is called the positive phase and the second 
term the negative phase. Similar to (3), the exact computation 
of the derivative of the log-likelihood is usually unfeasible 
because of the negative phase in (4), which comes from the 
derivative of the partition function. 


B. Restricted Boltzmann Machines 

Restricted Boltzmann Machines are energy-based proba¬ 
bilistic models whose energy function is: 

Energy(a;, h) = — b l x — c f /i — h' Wx . (5) 

RBMs are at the core of DBNs [5] and other deep architectures 
that use RBMs for unsupervised pre-training previous to the 
supervised step [6], [7], [13]. 

The consequence of the particular form of the energy 
function is that in RBMs both P(h\x) and P(x\h) factorize. 
In this way it is possible to compute P(h\x) and P(x\h) 
in one step, making it possible to perform Gibbs sampling 
efficiently, in contrast to more general models like Boltzmann 
Machines [24]. 


C. Contrastive Divergence 

The most common learning algorithm for RBMs uses an 
algorithm to estimate the derivative of the log-likelihood of a 
Product of Experts model. This algorithm is called Contrastive 
Divergence [12], 

Contrastive Divergence CD n estimates the derivative of the 
log-likelihood for a given point x as 


8 log P(x; 9) 
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where x n is the last sample from the Gibbs chain starting from 
x obtained after n steps: 
hi ~ P(h\x) 


Xi ~ P(x\hi) 


h n ~ P[h\x n _i) 
x n ~P(x\h n ) . 


Usually, E p(h \ a 


can be easily computed. 


"OEnergy(a:,/l)" 

86 

Several alternatives to CD n are Persistent CD [18], Fast 
Persistent CD [19] or Parallel Tempering [20]. 


D. Monitoring the Learning Process in RBMs 

Learning in RBMs is a delicate procedure involving a lot of 
data processing that one seeks to perform at a reasonable speed 
in order to be able to handle large spaces with a huge amount 
of states. In doing so, drastic approximations that can only be 
understood in a statistically averaged sense are performed [25]. 

One of the most relevant points to consider at the learning 
stage is to find a good way to determine whether a good 
solution has been found or not, and so to decide when the 
learning process should stop. One of the most widely used 
criteria for stopping is based on the monitorization of the 
reconstruction error, which is a measure of the capability of the 
network to produce an output that is consistent with the data at 
input. Since RBMs are probabilistic models, the reconstruction 
error of a data point x W is computed as the probability of xJ l> 
given the expected value of h for x W; 


i?(tc (i) ) = -logP (x (i) \E h\x {i) ) , (7) 


which is a probabilistic extension of the sum-of-squares re¬ 
construction error for deterministic networks 


e(*W) = 


\x^ -X^W 2 


( 8 ) 


Some authors have shown that, in some cases, learning 
induces an undesirable decrease in likelihood that goes unde¬ 
tected by the reconstruction error [1], [2], It has been shown 
[2] that the reconstruction error defined in (7) usually de¬ 
creases monotonically. Since no increase in the reconstruction 
error takes place during training there is no apparent way to 
detect the change of behavior of the log-likelihood for CD„. 


III. Proposed Stopping Criterion 


The proposed stopping criterion is based on the monitoriza¬ 
tion of the ratio of two quantities: the geometric average of the 
probabilities of the training set, and the sum of probabilities 
of points in a given neighbourhood of the training set. More 
formally, what we monitor is 


id, 




1 t/N 


\D\ 


P(y U) ) ’ 


(9) 


where D is a subset of points at a Hamming distance from the 
training set less or equal than d. The idea behind the definition 
is that the evolution of £<* at the learning stage is expected to 
closely resemble that of the log-likelihood for certain values of 
d and D. For that reason we propose as the stopping criterion 
to find the maximum of which will be close to the one 
shown by the log-likelihood of the data, as shown by the 
experiments in the next sections. 
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The reason for that is twofold. On one hand the numerator 
and denominator monitor different things. The numerator, 
which is essentially the likelihood of the data, is sensitive to 
the accumulation of most of the probability mass by a reduced 
subset of the training data, a typical feature of CD n . For 
continuity reasons, the denominator is strongly correlated with 
the sum of probabilities of the training data. Once the problem 
has been learnt, the probabilities in a close neighborhood of the 
training set will be high. The value of Cl results from a delicate 
equilibrium between these two quantities (see section IV), 
which we propose to use as a stopping criterion for learning. 
On the other hand, due to the structure of the partition 
functions Z involved in both the numerator and denominator 
cancels out, which is a necessary condition in the design of the 
quantity being monitorized. In other words, the computation 
of can be equivalently defined as 




nf=i E h e 


—Energy 


1/N 


_l v r, P - Ener gy(y u) .^) 


( 10 ) 


The particular topology of RBMs allows to compute 
Eh e -Energy ( ;E ’^' ) efficiently. This fact dramatically decreases 
the computational cost involved in the calculation, which 
would otherwise become unfeasible in most real-world prob¬ 
lems where RBMs could been successfully applied. 

While the numerator in £<* is directly evaluated from the data 
in the training set, the problem of finding suitable values for 
y still remains. Indeed, the set of points at a given Hamming 
distance d from the training set is independent of the weights 
and bias of the network. In this way, it can be built once at 
the very beginning of the process and be used as required 
during learning. Therefore, two issues have to be sorted out 
before the criterion can be applied. The first one is to decide 
a suitable value of d. Experiments with different problems 
show that this is indeed problem dependent, as is illustrated in 
the experimental section below. The second one is the choice 
of the subset D , which strongly depends on the size of the 
space being explored. For small spaces one can safely use the 
complete set of points at a distance less than or equal to d, but 
that can be forbiddingly large in real world problems. For this 
reason we explore two possibilities: one including all points 
and another including only a random subset of the same size 
as the training set, which is only as expensive as dealing with 
the training set. 


IV. Experiments 

We performed several experiments to explore the aforemen¬ 
tioned criterion defined in section III and study the behavior of 
fd in comparison with the log-likelihood and the reconstruc¬ 
tion error of the data in several problems. We have explored 
problems of a size such that the log-likelihood can be exactly 
evaluated and compared with the proposed £<* parameter. 

The first problem, denoted Bars and Stripes (BS), tries to 
identify vertical and horizontal lines in 4x4 pixel images. The 
training set consists in the whole set of images containing all 
possible horizontal or vertical lines (but not both), ranging 
from no lines (blank image) to completely filled images 


(black image), thus producing 2 x 2 4 — 2 = 30 different 
images (avoiding the repetition of fully back and fully white 
images) out of the space of 2 16 possible images with black 
or white pixels. The second problem, named Labeled Shifter 
Ensemble (FSE), consists in learning 19-bit states formed as 
follows: given an initial 8-bit pattern, generate three new states 
concatenating to it the bit sequences 001, 010 or 100. The final 
8-bit pattern of the state is the original one shifting one bit to 
the left if the intermediate code is 001, copying it unchanged 
if the code is 010, or shifting it one bit to the right if the code 
is 100. One thus generates the training set using all possible 
2 8 x 3 = 768 states that can be created in this form, while 
the system space consists of all possible 2 19 different states 
one can build with 19 bits. These two problems have already 
been explored in [2] and are adequate in the current context 
since, while still large, the dimensionality of space allows for 
a direct monitorization of the partition function and the log- 
likelihood during learning. For the sake of completeness, we 
have also tested the proposed criterion on randomly generated 
problems with different space dimensions, where the number 
of states to be learnt is significantly smaller than the size 
of the space. In particular, we have generated four different 
data sets (RAN10, RAN12, RAN14 and RAN16) consisting 
of N v = 10,12,14,16 binary input units and 2 Ar ”/ 2 examples 
to be learnt, as suggested in [26]. 

In the following we discuss the learning processes of 
these problems with binary RBMs, employing the Contrastive 
Divergence algorithm CD,, with n = 1 and n = 10 as 
described in section II-C. In the BS case the RBM had 16 
visible and 8 hidden units, while in the FSE problem these 
numbers were 19 and 10, respectively. For the random data 
sets we have used 10 hidden units in each case. 

Every simulation was carried out for a total of 50000 
epochs, with measures being taken every 50 epochs. Moreover, 
every point in the subsequent plots was the average of ten 
different simulations starting from different random values of 
the weights and bias. Other parameters affecting the results 
that were changed along the analysis are the learning rates 
involved in the weight and bias update rules. No weight decay 
was used, and momentum was set to 0.8. The learning rates 
were chosen in order to make sure that the log-likelihood 
degenerates, in such a way that it presents a clear maximum 
that should be detected by £<*• 

In the following we perform two series of experiments that 
are reported in the next two subsections. In the first one 
(section IV-A) we analyze the case where all states in D 
are included. In the second one (section IV-B) we relax the 
computational cost of the evaluation of Cl by selecting only a 
small subset of all the states in D. 

A. Complete Neighborhoods 

We present the results for the problems at hand, showing 
for each analyzed instance different plots corresponding to the 
actual log-likelihood of the problem and Cl for different values 
of d, among other things. In order to identify the contributions 
to f i from the different neighborhoods of the training set, we 
define two different sets: Da containing all states at a distance 
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Fig. 1. Results for the RAN 10 problem. The first column shows the log-likelihood (top) and the reconstruction errors (7) and (8) (center and bottom). The 
other columns in the first, second and third rows depict f° r D = Da, the sum of probabilities in the denominator of f° r D = Da, and f° r D = Ds 
for d = 0,1 9 .3 resnertive.lv The v-avis is the number of epochs along the simulation divided by 50 in all plots. All data in the v-axis are in arbitrary units. 
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Fig. 2. Results for the RAN14 problem. The first column shows the log-likelihood and the reconstruction error (7) (top and bottom panels). The other 
columns in the upper and lower rows show f° r D = Da and for D = Ds for d = 0,1, 2, 3, respectively. 


less than or equal to d, and Ds accounting for those states at a 
distance exactly equal to d. We have computed for D = Da 
and D = Ds in all our experiments that are commented in 
the following. 

Figure 1 shows our results for the RAN 10 data set. The 
upper left panel shows the log-likelihood of the data during 
training. As it can be seen, there is a clear maximum that 
should be identified as the stopping point. The panels below 
show the reconstruction errors (7) and (8) which clearly fail to 
identify the desired extremum. The rest of the columns show 
results for distances d = 0,1,2 and 3. The first row depicts 
£d for Da, where all states at the required distances are taken 
into account. As it can be seen, starting at d = 1 the criterion 
is robust and consistently detects the maximum of the log- 
likelihood at the right place, thus reinforcing the idea that 


the neighborhood of the data contains valuable information. 
The second row shows the denominator of £<* corresponding 
to the first row, that is, the sum of probabilities of the states 
included in each case. Notice that for d = 3 this sum equals 
one and £,[ is exactly equal to the likelihood of the data. More 
interestingly, even when the sum is still far away from one, 
as it happens for d = 1, consistently finds the desired 
point. This behavior is also observed in the rest of the data 
sets analyzed. Finally the third row shows for Ds, thus 
showing the behavior of the criterion applied to different 
shells. For d = 1 and 2 the criterion detects reasonably well 
the maximum of the log-likelihood and can be used to identify 
the desired stopping point. Notice, though, that the data alone, 
entirely contained at d = 0, is not capable to reproduce this 
behavior. Moreover, for d larger than 2 the criterion also 
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Fig. 4. Same as in figure 2 for the LSE data set. 


fails, as it is expected that starting at a certain distance the 
information regarding the model is lost. Please notice that 
the initial transitory behavior of some of the plots above is 
meaningless and can be omitted so it has been cut. 

Equivalent results for the RAN14 case are shown in figure 2. 
The log-likelihood and the probabilistic reconstruction error 
in (7) are depicted in the upper and lower panels in the first 
column, respectively. The other panels show £<* for 1) .\ and 
Dg, with d = 0,1, 2,3 (top and bottom rows, second to fifth 
columns). As in the previous case, the reconstruction error fails 
to detect the maximum of the likelihood, thus not being very 
useful in the present context. On the contrary, a stopping point 
obtained from selects a near-optimal model. Notice that 
the criterion is robust along all distances explored, as desired. 
Similar results are found for the RAN12 and RAN16 cases. 
As it can be inferred from these results, the optimal value of 
d can not be fixed beforehand and is problem-dependent. 

The same plots for the BS and LSE problems are found in 
figures 3 and 4. Once again, the reconstruction error decreases 
monotonously and is therefore useless in the present context, 
while g,i for d larger than 1 successfully does the task for 
D = Da, while for D = Dg the criterion does not work in 
the BS problem. 


B. Uncomplete Neighborhoods 

Despite the success of the criterion built as proposed, it is 
clear that for large spaces it can be unpractical if the number 
of states in the neighborhood of the training set is very large. 
For that reason, we have tested the criterion on randomly 
selected subsets Da C Da of the same size as the training 
set, which is always computationally tractable. In this sense, 
we denote by the evaluation of on Da- Figure 5 shows 
£,d compared with E,d from the previous figures for the BS 
(first row) and LSE (second row) problems. More precisely, 
the first column shows the log-likelihood of the data along the 
training process, while the rest of the columns plot both £d 
and for d = 0,1,2 and 3. Notice that the absolute scales 
of and gd may vary mainly due to the value of the sum of 
probabilities in the denominators. However, since the precise 
value of these quantities is irrelevant, we have decided to scale 
them properly for the sake of comparison. Although £d is built 
from a much smaller set than it captures all the significant 
features of ^ and can therefore be used instead of it. In this 
sense, £d provides a good stopping criterion for CDi, although 
it is not as robust as due to the strong reduction of states 
contributing to as compared with those entering in This 
reduction is illustrated in table I, where we show the number 
of neighboring states to the data set at different distances for 








6 


Data Set 

Hamming Distance 


i 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Bars and Stripes 

480 

3216 

11360 

20744 

19296 

8688 

1632 

90 

- 

- 

Labeled Shifter Ensemble 

8434 

41160 

110326 

165088 

132976 

54160 

10368 

966 

40 

2 


TABLE I 


Number of neighbors at different Hamming distances for the BS and LSE data sets. 












Fig. 5. Comparison between L/ (black curves) and L/ (red curves) for the BS and LSE data sets (upper and lower rows). Notice that since the magnitude of 
these parameters is irrelevant, some curves have been scaled for the sake of clarity. The first column plots the loa-likelihood of the data along the simulation. 







Fig. 6. Same as in figure 5 for the LSE problem in CDiq. 


the BS and LSE problems. By increasing the number of states 
included in convergence to £,/ is expected at the expense of 
an increase in computational cost. However, the present results 
indicate that, at least for the problems at hand, a number of 
examples similar to that of the training set in the evaluation 
of is enough to detect the maximum of the log-likelihood 
of the data. 

All the results presented up to this point show the goodness 
of the proposed stopping criterion for learning in CDi. How¬ 
ever, the underlying idea can be applied to different learning 
algorithms that try to maximize the log-likelihood of the data. 
In this way we have repeated all the previous experiments 
for CDio with very similar results to the ones above. As an 
example, figure 6 shows the log-likelihood, £d and fy/ with 
d = 0,1, 2,3 and CDio for the LSE data set, which is the 
largest one analyzed in this work. As it is clearly seen, the 
quality of the results is very similar to the CDi case, thus 
stressing the robustness of the criterion. 

As a final remark, we note that for the BS problem the 
trained RBM stopped using the proposed criterion is able to 
qualitatively generate samples similar to those in the training 
set. We show in figure 7 the complete training set (two upper 
rows) and the same number of generated samples (two lower 


rows) obtained from the RBM trained with CDi and stopped 
after 5000 epochs, around the maximum shown by 
which approximately coincides with the optimal value of the 
log-likelihood. It is important to realize that, ultimately, the 
quality of the model is a direct measure of the quality of CDi 
learning, and that the model used to generate the plots is the 
one with largest £,i, which is quite close to the one with largest 
likelihood. 

V. Conclusions 

In this work we have introduced the contribution of neigh¬ 
boring points to the training set to build a stopping criterion for 
learning in CDi. We have shown that not only the training set 
but also the neighboring states contain valuable information 
that can be used to follow the evolution of the network along 
training. 

Based on the fact that learning tries to increase the contri¬ 
bution of the relevant states while decreasing the contribution 
of the rest, continuity and smoothness of the energy function 
assigns more probability to states close to the training data. 
This is the key idea behind the proposed stopping criterion. 
In fact, two different but related estimators (depending on the 
number of states used to compute them) have been proposed 
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Fig. 7. Training data (two upper rows) and generated samples (two lower rows) for the BS problems with the weights and bias obtained at the stopping 
point detected by ^ with d = 1. 


and tested experimentally. The first one includes all states 
close to the training set, while the second one takes only a 
fraction of these states as small as the size of the training 
set. The first estimator is robust but may require from the use 
of a forbiddingly large amount of states, while the second 
one is always tractable and captures most of the features 
of the first one, thus providing a suitable stopping learning 
criterion. This second estimator could be used in larger data 
set problems, where an exact computation of the log-likelihood 
is not possible. Additionally, the main idea of proximity to the 
training set will be explored in other aspects related to learning 
in future work. 
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